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Abstract 

Background: Many problems in protein modeling require obtaining a discrete representation of the protein 
conformational space as an ensemble of conformations. In ab-initio structure prediction, in particular, where 
the goal is to predict the native structure of a protein chain given its amino-acid sequence, the ensemble 
needs to satisfy energetic constraints. Given the thermodynamic hypothesis, an effective ensemble contains 
low-energy conformations which are similar to the native structure. The high-dimensionality of the 
conformational space and the ruggedness of the underlying energy surface currently make it very difficult to 
obtain such an ensemble. Recent studies have proposed that Basin Hopping is a promising probabilistic search 
framework to obtain a discrete representation of the protein energy surface in terms of local minima. Basin 
Hopping performs a series of structural perturbations followed by energy minimizations with the goal of 
hopping between nearby energy minima. This approach has been shown to be effective in obtaining 
conformations near the native structure for small systems. Recent work by us has extended this framework to 
larger systems through employment of the molecular fragment replacement technique, resulting in rapid 
sampling of large ensembles. 

Methods: This paper investigates the algorithmic components in Basin Hopping to both understand and control 
their effect on the sampling of near-native minima. Realizing that such an ensemble is reduced before further 
refinement in full ab-initio protocols, we take an additional step and analyze the quality of the ensemble retained 
by ensemble reduction techniques. We propose a novel multi-objective technique based on the Pareto front to 
filter the ensemble of sampled local minima. 

Results and conclusions: We show that controlling the magnitude of the perturbation allows directly controlling 
the distance between consecutively-sampled local minima and, in turn, steering the exploration towards 
conformations near the native structure. For the minimization step, we show that the addition of Metropolis Monte 
Carlo-based minimization is no more effective than a simple greedy search. Finally, we show that the size of the 
ensemble of sampled local minima can be effectively and efficiently reduced by a multi-objective filter to obtain a 
simpler representation of the probed energy surface. 
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Background 

Many problems in protein modeling demand obtaining a 
discrete representation of the protein conformational 
space in terms of an ensemble of conformations. In the 
ab-initio structure prediction problem, in particular, where 
the goal is to predict the native structure of a protein 
chain given its amino-acid sequence, the ensemble needs 
to satisfy certain energetic constraints. Under the thermo- 
dynamics treatment [1], the native structure is located at 
the basin of a funnel-like energy surface [2,3] . Thus, search 
algorithms that generate conformations and are guided 
towards low-energy ones by a potential energy function 
should obtain an effective ensemble containing low-energy 
conformations near the native structure. This is predomi- 
nantly not the case due to the size and high-dimensionality 
of the protein conformational space and the ruggedness of 
the underlying energy surface [4]. Despite these challenges, 
the rapidly growing gap between the wealth of protein 
sequence data and the relatively sparse set of experimen- 
tally-determined native protein structures necessitates 
research into computational approaches to determining 
protein structure. The ability to deter-mine structural infor- 
mation through ab-initio computational methods promises 
to elucidate the relationship between protein structure and 
function and advance studies of biological function and 
drug design. [5-7]. 

The two predominant reasons that it is challenging to 
obtain a conformational ensemble near the (unknown) 
native structure of a protein are poor sampling capability 
by the search algorithm and inaccuracies in the energy 
function employed by this algorithm to probe low-energy 
regions of the energy surface. Limited sampling capability 
is to be expected when considering a vast high-dimen- 
sional search space. For the purpose of illustrating this 
point, consider a protein chain of n amino acids. Each 
amino acid contains a group of atoms. A shared subset 
among all known amino acids, known as backbone 
atoms, defines the main backbone thread that runs 
through the protein chain. Even if focusing on modeling 
only this thread and its spatial arrangements, which we 
refer to as conformations, the space populated by these 
conformations has many dimensions. There are 4 heavy 
backbone atoms per amino acid. A cartesian representa- 
tion would define a 4 * 3«-dimensional space. One can 
reduce this down to a 3k- or a 2«-dimensional space if 
instead of maintaining cartesian coordinates, only back- 
bone dihedral angles are maintained to represent a con- 
formation. For a small protein of 30 amino acids, the 
conformational space has at least 60 dimensions in this 
angular representation. 

The high-dimensionality of the search space favors cer- 
tain approaches to the problem of obtaining an ensemble 
of conformations near the native structure in a reason- 
able amount of time. Methods based on the Molecular 



Dynamics (MD) approach simulate the actual folding 
process where a protein slowly tumbles down the energy 
surface from its unfolded to the folded native state. Simu- 
lating folding kinetics demands very small moves in the 
energy surface in order to retain accuracy when integrat- 
ing equations of motions. For this reason, MD-based 
approaches demand significant computational resources 
(e.g., Folding@Home) and/or specialized hardware (e.g. 
Antoine) [8,9]. Conducting, instead, a global energy opti- 
mization which forgoes information of folding kinetics is 
useful and justified under the thermodynamics treatment. 
Approaches based on optimizing the energy of a confor- 
mation can obtain native conformations orders of magni- 
tude faster than approaches that simulate folding 
pathways [10]. Many of these approaches follow the 
Monte Carlo (MC) approach in order to enhance their 
sampling capability over the MD approach. MC-based 
approaches, however, still struggle to obtain native 
conformations on medium-size proteins due to the com- 
plexity of the protein energy surface [4]. Due to this sig- 
nificant challenge, stochastic optimization algorithms for 
protein conformational search remains a very active field 
of research. [11]. 

Many stochastic optimization techniques for ab-initio 
structure prediction have converged on a unifying strategy 
of sampling of a large number of low-energy conforma- 
tions. The emphasis on the size is due to the fact that 
many local minima may be present in the energy surface, 
particularly in those constructed by current functions 
available to measure the potential energy of a protein con- 
formation. Sampled conformations are the end points of 
independent MD or MC trajectories which perform a 
local optimization on a given coarse-grained energy func- 
tion. In full ab-initio protocols, stochastic optimization 
with a coarse-grained energy function constitutes only 
stage one. After the ensemble of low-energy conforma- 
tions is obtained, often referred to as decoys, the decoy 
ensemble is reduced in preparation for a second stage of 
optimization. The reduction employs either filtering by 
energies or grouping by structural similarity through clus- 
tering-based techniques. The purpose of the reduction is 
to reveal a subset of conformations representing local 
minima that are worth optimizing further at greater struc- 
tural detail and through some finer-grained energy func- 
tion in order to improve their proximity to the native 
structure [5,12-17]. 

Optimization-based approaches are effective at obtain- 
ing native conformations for many small to medium size 
proteins, however, the accuracy of these approaches is 
ultimately bound by the accuracy of the employed energy 
function. State-of-the-art energy functions employ 
approximations to improve performance, but these 
approximations can lead to errors in the energy function 
which are responsible for deviations between the global 
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minimum of the energy function for a particular protein 
and the experimentally-determined native structure for the 
protein [10,18]. Because of these deviations, approaches 
which sample a broad range of low-energy minima, rather 
than focussing on a single global minimum, are more 
appropriate for coarse-grained energy functions. These 
sampled minima can then be further scrutinized though 
additional heavy-duty optimization techniques. 

In most MC-based methods, the broad view is obtained 
by launching many independent MC trajectories. In 
other approaches, the trajectories are integrated into a 
tree-based or population-based search framework, main- 
taining a broader view and thus a more diverse decoy 
ensemble by employing analysis of the ensemble to effec- 
tively guide the search towards relevant regions of the 
search space [19-22]. In robotics-inspired approaches, a 
tree of conformations grows in conformational space 
[19,20], and low-dimensional embeddings of the energy 
surface and conformational space are used to collect 
online statistics with which to adaptively bias the search 
towards low-energy regions and away from over-sampled 
regions. In evolutionary-inspired approaches [21,22], 
multi-objective analysis of energy terms is used to guide 
the search towards a diverse population of conforma- 
tions. Currently, this multi-objective analysis is applied 
only to all-atom representations and applied to very 
small proteins. 

None of the above methods explicitly sample local 
minima in the energy surface. Rather, they rely on some 
post-analysis to group conformations together to identify 
captured local minima. Recent studies by us and others 
have proposed that Basin Hopping (BH) is a promising 
stochastic optimization framework to directly obtain a 
discrete representation of the protein energy surface in 
terms of local minima [23-25]. The framework was ori- 
ginally introduced to obtain the Lennard-Jones minima 
of small atomic clusters [26]. The inspiration for the BH 
framework in [26] comes from evolutionary search algo- 
rithms, such as Iterated Local Search (ILS). ILS consists 
of iterated applications of perturbation followed by local 
search and is popular for solving discrete optimization 
problems [27] . An adaptation of ILS for molecular mod- 
eling introduces a Metropolis-like criterion to bias the 
sampling of local minima towards lower-energy regions 
of the search space. 

Pervious reaUzations of the BH framework, most notably 
in the MC with Minimization algorithm, essentially differ 
in their implementation of the perturbation and minimiza- 
tion components [28,29]. The perturbation component 
typically direcdy modifies the atomic coordinates of a con- 
formation and minimization is performed through a gradi- 
ent descent or low- temperature Metropolis MC trajectory. 
Successful applications of BH algorithms include obtaining 
local minima of small atomic clusters, mapping the energy 



surface of polyalanines, and modeling of other small pro- 
teins [10,30-32]. 

In recent years, new attention has been given to BH as 
a framework for ab-initio protein structure prediction 
[23-25,33]. In [23], a particular realization of the BH 
framework on small proteins is shown to obtain both 
lower-energy minima and conformation closer to the 
experimentally-determined native structure than the MD 
with Simulated Annealing approach. Here conformations 
are perturbed by modifying atomic coordinates by small 
random values. Minimization is then implemented as a 
gradient descent over a coarse-grained energy function. 
While effective on small proteins, the performance of this 
implementation decreases significantiy on sequences with 
more than 75 amino acids [23]. 

In recent work, we extend the effectiveness of BH to 
longer protein sequences by employing molecular frag- 
ment replacement with a coarse-grained energy function 
[24,25] (detailed in the Methods section). Experiments 
show that the resulting BH algorithm is able to sample 
conformations near experimentally-determined native 
structures as well as other state-of-the-art structure pre- 
diction protocols. This proposed coarse-grained sampling 
algorithm is intended to generate decoy conformations as 
the first step in a structure prediction protocol which then 
further refines selected decoy conformations. 

Given the recent attention and promise of BH as a 
framework for protein structure prediction, a greater 
understanding of the effectiveness and efficiency of the 
key BH components is critical. While some studies into 
the efficacy of different perturbation moves for identify- 
ing low-energy isomers of small Si and CU clusters exist 
in the computational physics community [34], no such 
study is available for proteins. 

In this work we offer a detailed analysis of the BH fra- 
mework in the context of structure prediction. We provide 
an in-depth analysis of BH's two key components, pertur- 
bation and minimization, and show how adjusting these 
components affects sampling of decoy conformations. 
Controlling the magnitude of each perturbation allows us 
to directly control the distance between consecutively- 
sampled local minima. We show that this local-minima 
distance is directly related to the ability of the BH algo- 
rithm to effectively explore the conformational space and 
obtain conformations near the native protein structure. 
We also explore the use of temperature when employing 
Metropolis MC minimization and show that a shorter 
greedy search is just as effective as a more intensive 
Metropolis MC minimization. 

Our BH algorithm is effective at rapidly sampling 
large numbers of decoy conformations that represent 
local minima in the protein energy surface. Here we 
extend analysis of this decoy ensemble beyond simply 
comparing the decoys with the lowest IRMSD to the 
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experimentally-determined native structure. Realizing that 
the true utility of a stochastic optimization technique is in 
which subset of its conformations would be retained for 
further refinement in a complete ab-initio protocol, we 
pursue different reduction techniques and analyze how 
each of those would retain near-native conformations 
sampled by the BH algorithm. 

We show, as expected, that ensemble reduction techni- 
ques based on total energy miss many promising near- 
native conformations. This is to be expected, as a method 
with high sampling capability will uncover many low- 
energy non-native conformations. Given the growing 
knowledge that current energy functions, particularly 
coarse-grained ones, are weakly fimneled, displaying very 
weak correlation between low energies and proximity to 
the native structure, no energetic threshold will discard 
non-native and retain near-native conformations. Our 
analysis shows this on 15 diverse protein systems. On the 
other hand, reduction techniques that discard energies 
and instead cluster conformations by structural similarity 
can be quite computationally demanding with large 
ensemble sizes (106 conformations or more). Such tech- 
niques would also not be viable if there is a need to possi- 
bly apply them repeatedly during search. 

We introduce here a novel energy-based ensemble 
reduction technique that makes use of multi-objective 
analysis to enhance retention near-native decoys. The 
technique decomposes the energy of each conformation 
into the various terms in the energy function and evalu- 
ates conformations based on Pareto count and the Pareto 
front. The analysis is particularly suited to finding a sub- 
set of conformations that satisfy conflicting terms, as is 
the case with terms added up in energy functions. We 
show that our Pareto-based selection scheme significantly 
reduces the size of the decoy ensemble, while retaining a 
more diverse set of near-native conformations than 
employing a total energy threshold. These results are 
shown to be robust and valid when using two different 
state-of-the-art coarse-grained energy functions com- 
monly employed in a structure prediction setting. The 
computational complexity of computing these multi- 
objective metrics makes them practical, even on very 
large ensembles of decoy conformations. Since the Pareto 
front and Pareto count can be computed online, these 
multi-objective energy metrics are also ideal to be 
employed in online analyses used by tree-based and 
population-based search algorithms to adaptively guide 
search. 

A preliminary investigation of this ensemble reduction 
technique was presented in [33]. In this work, we extend 
the BH framework to employ two different state-of-the-art 
energy functions and analyze the effectiveness the of 
ensemble reduction technique on the energy surface 
sampled by both energy functions. 



Methods 

Obtaining a broad view of the energy surface for a pro- 
tein sequence of interest in the coarse-grained stage relies 
on a stochastic optimization algorithm to go through dif- 
ferent conformations and an energy function to score 
these conformations and guide the search towards low- 
energy ones. As described in the Background section, 
coarse graining in this stage refers to the employment of 
a coarse-grained representation for the protein chain. As 
in many state-of-the-art ab-initio protocols, we employ 
an extended backbone representation in our BH-based 
algorithm, sacrificing side chains. This representation is 
detailed first below, in the Molecular representation sec- 
tion. Given a coarse-grained representation, a coarse- 
grained energy function scores conformations generated 
by the search algorithm. We consider here two state-of- 
the-art coarse-grained energy functions, the AMW and 
the Rosetta energy functions, briefly described below in 
the Coarse-grained energy function section. The BH- 
based stochastic optimization algorithm that makes use 
of the chosen representation and energy function(s) is 
described next, followed by details on the different imple- 
mentations considered and analyzed for its perturbation 
and minimization components. The implementations for 
the algorithmic components of the algorithm are ana- 
lyzed in detail for how they affects the quality of the 
(decoy) ensemble of local minima produced by the algo- 
rithm. The Pareto-optimal filtering of this ensemble is 
described last. 

Molecular representation 

The structural detail in the side chains of a protein is lar- 
gely sacrificed in the interest of expediency. It is worth 
noting that once the decoy ensemble is obtained and 
reduced through selection techniques, the retained coarse- 
grained conformations are added structural detail through 
side-chain pacldng techniques [35,36]. The AMW and the 
Rosetta coarse-grained energy functions considered here 
and described below operate on slightly different extended 
backbone representations. In both cases, the backbone 
heavy atoms N , C, Ca, and O are explicitly modeled. 
When using AMW, side-chains are reduced to only the 
Q3 atom (with exception of glycine, where there is no 
such atom). When using Rosetta, a side chain is reduced 
to a pseudo-atom centered at the side chain's centroid. 

Cartesian coordinates for the atoms modeled are 
employed by the respective energy functions to associate a 
potential energy value or score with a generated confor- 
mation. Internally, the representation employed by the 
algorithm to generate conformations maintains only three 
backbone dihedral angles (<p, y/, w) per amino acid. This 
angular representation, also known as a kinematic model, 
is based on the idealized geometry assumption, which 
fixes bond lengths and angles to idealized (native) values 
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(taken from CHARMM22 [37]) and limits variations to 
backbone dihedral angles. Using this angular representa- 
tion, the BH algorithm essentially generates conformations 
by replacing values for an entire block of (p, y/, co angles of 
/ consecutive amino acids at a time {fis often referred to 
as the fragment length). New values for a block are 
sampled from a fragment configuration library, which 
essentially stores blocks of angles observed in known 
native structures, as described in the Background section. 
After a conformation is obtained in its angular representa- 
tion, forward kinematics is employed to obtain cartesian 
coordinates for the modeled atoms from the backbone 
dihedral angles [38]. 

Coarse-grained energy function 

Our experiments in this paper consider two state-of-the- 
art coarse-grained energy functions, the Associative 
Memory Hamiltonian with Water (AMW), and the 
Rosetta energy function, described below. 
AMW energy function 

This coarse-grained potential, originally proposed in [39], 
has been used by us and others in the context of different 
search procedures for the purpose of decoy sampling in 
ab-initio structure prediction [12,19,20,40-42]. Briefly, 
AMW sums 5 non-local terms (local interactions are kept 
at ideal values under the idealized geometry assumption): 

-^AMW ~ -^Lennard-Jones -^H-Bond -^compaction -^burial 

£water- The -ELennard- Jones term is implemented after the 12- 
6 Lennard- Jones potential in AMBER9 [43] allowing a soft 
penetration of van der Waals spheres. The -En-Bond term 
allows modeling hydrogen bonds and is implemented as in 
[44]. The other terms, .E<.ompaction. -Eburiai. and -Ewater. allow 
formation of a hydrophobic core and water-mediated 
interactions (See [12] for more details). 
Rosetta energy function 

The Rosetta energy function we use here corresponds to 
the scores setting in the suite of energy functions used in 
the Rosetta ab-initio protocol [45]. The different energy 
functions used in the Rosetta ab-initio protocol are scaled 
versions of a full energy function that is a linear combina- 
tion of 10 terms. These terms measure repulsion, amino- 
acid propensities, residue environment, residue pair 
interactions, interactions between secondary structure ele- 
ments, density, and compactness. The different substages 
used in the Rosetta ab-initio protocol use subsets of the 
terms of the full energy function and modify weights in 
the linear combination to promote certain interactions 
over others. We use here the score3 setting, as this corre- 
sponds to the full coarse-grained Rosetta energy function. 

Probabilistic search algorithm based on basin hopping 
framework 

We first proposed the BH-based probabilistic search 
algorithm that we analyze in detail in this paper in [25]. 



The algorithm iteratively hops between consecutive 
minima C/ and C,+i by performing a perturbation fol- 
lowed by a minimization. Conformation C, is perturbed 
to obtain a new higher-energy conformation Cperturb.i 
which allows the search to escape from its current local 
minimum. Cperturb.i is then minimized through a series 
of small modifications until a new minimum C,+i is 
reached. The Metropolis criterion is then employed to 
determine whether or not the current state of the trajec- 
tory is moved to C,+i based on the energetic difference 
between C, and C,+i. This results in a trajectory of con- 
formations representing local minima in the energy sur- 
face. The Metropolis criterion guides the trajectory 
towards lower-energy regions of the energy surface. 
Thus, the ensemble of decoy conformations obtained 
with BH consists of good-quality conformations that 
represent local minima in the protein energy surface. 

The two main components in the algorithm are the 
perturbation and minimization. They both modify con- 
formations using the molecular fragment replacement 
technique described in the Background section. Briefly, 
given a conformation, a trimer (three consecutive amino 
acids) is selected at random over the target protein 
sequence. A configuration for that trimer (consisting of 9 
backbone dihedral angles - (p, y/, co for each of the amino 
acids in the trimer) is then obtained at random over the 
available ones in a fragment configuration library. The 
library is pre-compiled from configurations extracted 
from known non-redundant native structures. The frag- 
ment configuration library is constructed as in the proto- 
col outlined in the Rosetta ab-initio package (for further 
details, cf. to Ref [25]). While the perturbation replaces 
one trimer configuration, the minimization consists of 
repeated replacements until a certain preset number of 
consecutive attempts fail to lower energy. 

In this work we propose and analyze different imple- 
mentations for the minimization and perturbation compo- 
nents, paying attention to how they affects the quality of 
the decoy ensemble. We do not explicitly analyze the effi- 
cacy of different moves that one can employ in perturba- 
tion. Comparative results between work in [23], which 
applies small random perturbations to atomic coordinates, 
and work in [25], which applies trimer configuration repla- 
cements, suggests that the latter moves are more efficient 
with growing sequence length and confer higher sampling 
capability. 

Perturbation 

In order to effectively explore the conformational space, 
the magnitude of the perturbation must be large enough 
to escape the current local minimum, but not so large that 
consecutively sampled local minima are too unrelated in 
the conformational space. If the perturbation magnitude is 
too small, the minimization step is likely to return to the 
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previous local minima. Even if a new minima is reached, if 
the average distance between Ci and Q+i is too small, then 
the search will be too inefficient to cover the breadth of 
the protein conformational space. If the perturbation mag- 
nitude is too larger, however, then the search effectively 
samples local minima at random over the entire energy 
surface and cannot be effectively guided by the Metropolis 
criterion towards lower-energy regions. 

Perturbation is performed through a single trimer frag- 
ment replacement on Q to obtain CperturW- Since the mag- 
nitude of each perturbation (measured as the IRMSD 
between C, and Cperturb.i)) varies based on which fragment 
configuration is selected from the fragment library, the fol- 
lowing technique is employed to explicitly bias the magni- 
tude of each perturbation to a configured value D. For 
each perturbation, a target magnitude d is sampled from a 
gaussian distribution centered at D with a standard devia- 
tion of 1. New perturbed conformations are then sampled 
through fragment replacement until a conformation 
Cperturb.i IS found which is d A IRMSD from Q (within a 
tolerance t).l{ n attempts have been made without finding 
a conformation which satisfies the target perturbation 
magnitude, then the conformation which comes closest to 
satisfying this target is used as Cperturb.i- The value of « is 
set to 20, which is large enough that a conformation 
Cperturb.i Can bc found within a tolerance oi t = O.SA in 
nearly every case. Since only the final conformation 
selected for Cperturb.i is evaluated for energy, this process of 
sampling multiple perturbation candidates does not add 
significant computational time to the overall algorithm. 

Minimization 

The minimization component maps a perturbed confor- 
mation to a nearby local minima in the protein energy 
surface through a series of small modifications. Since the 
minimization step consumes the vast majority of the 
computational resources in a BH algorithm, it is impor- 
tant to balance the efficiency of a minimization technique 
with its effectiveness at probing local minima. In this 
work we compare the more computationally efficient 
greedy search, summarized above and implemented ori- 
ginally in [25], to a Metropolis MC (MMC) search for 
minimization. While MMC is more computationally 
intensive, it is able to probe deeper into local minima by 
adjusting the effective temperature of the search. We do 
not investigate gradient-based techniques, as they con- 
verge very slowly to a local minimum [23]. 

In a greedy search only modifications (referred to here 
as moves) are made which lower the energy of the confor- 
mation. An MMC search, however, will occasionally 
accept a move which raises the energy of the conformation 
in order to cross over an energetic barrier. The height of 
the energetic barrier which can be crossed is controlled by 
the effective temperature, T , employed by the metropolis 



criterion. By setting T to a small non-zero value, the 
MMC search can effectively jump over low energy barriers 
while remaining the in same local energy funnel. This 
allows a MMC search to reach deeper local minima than a 
greedy search which can get stuck on on these low ener- 
getic barriers. 

Probing down to true local minima in the protein 
energy surface can be computationally intensive and ana- 
lysis of the AMW energy surface in previous work shows 
that experimentally-determined native structures are 
found somewhere above their corresponding true minima 
[25]. For this reason, each MMC minimization is run 
until only k consecutive moves are rejected. For the pur- 
poses of this study, the working definition of a local 
minima is thus determined by the value of k. Based on 
previous work, k is set to the length of the target protein 
sequence which is sufficient to sampled near-native con- 
formations [25]. 

The temperature parameter, T , effectively controls the 
height of energy barriers which can be crossed by the 
MMC minimization. A higher value of T makes it less 
likely that the minimization will get stuck, and thus, on 
average, more MMC moves will be made before reaching 
the termination condition of k consecutive failed moves. 
In the special case where T = 0, the MMC search is effec- 
tively equivalent to the greedy search shown effective in 
our previous work [25]. In the Results section, we compare 
the effectiveness of greedy vs. MMC search in minimiza- 
tion. Three different effective temperatures are studied in 
the context of the MMC search. Temperatures, To, Ti, 
and T2, correspond to a 0.1 probability of accepting energy 
increases of 1.4, 1.7, and 2.6 kcal/mol, respectively. 

Multi-objective ensemble reduction 

The ensemble CI of local minima that is obtained by the 
BH-based algorithm under some chosen implementations 
of the perturbation and minimization components can be 
large. In a complete ab-initio structure prediction proto- 
col, a few promising coarse-grained structures are selected 
for refinement in greater structural detail. Therefore, the 
ensemble D. produced by the BH framework must be 
reduced to a relevant subset of local minima conforma- 
tions. Here a trade-off must be made between selecting a 
small number of conformations and selecting a diverse 
enough subset so as to increase the likelihood of retaining 
near-native conformations. 

A simple ensemble reduction technique which retains all 
conformations with an energy below a given threshold is 
problematic because there is no accepted method for 
selecting an appropriate threshold for an arbitrary protein 
system. Using the threshold method, it is likely that the 
reduced ensemble will either be too large to make fine- 
grained refinement practical or that many near-native con- 
formations will be excluded due to noise in the energy 
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function (recall that current energy functions are all 
weakly funneled and thus global minimum may not corre- 
spond to the native structure). By comparing energy terms 
individually, however, a more nuanced energetic compari- 
son can help remove some of the noise inherent in energy 
functions which results from the weighted linear combina- 
tion of unrelated energy terms [46]. This multi-objective 
analysis is the foundation of the technique we propose and 
analyze here to reduce CI. 

A conformation Q is said to dominate a conformation 
C, when every energy term in Q is lower than the corre- 
sponding term in C, . C, is said to be non-dominated if 
there is no conformation in Q, that dominates Cj . Confor- 
mations in the non-dominated ensemble, referred to as 
the Pareto front, are considered equivalent with respect to 
a multi-objective analysis. Figure 1 illustrates the the Par- 
eto front for a simplified energy function containing only 
two terms. 

When every term in Q is less than every term in C, , C, 
is said to strongly dominate C, . If the requirement for 
dominance is relaxed such that every term in Q is less 
than or equal to its corresponding term in C, , this is 
referred to as weak dominance. Typically, multi-objective 
analysis employs strong dominance, however, in some 
cases weak dominance may be more appropriate, particu- 
larly if one of the energy terms has a very low variance. 

/ \ 



strongly dominated by c 




strongly dominates 4 conformations and weakly dominates 1 
additional conformation, thus the Pareto count of C2 is 4 for strong 
Pareto dominance and 5 for weak Pareto dominance. 



Membership in the Pareto front is a binary state. It is 
often desirable to employ multi-objective analysis to rank 
conformations whether or not they lie in the Pareto 
front. One such metric is the Pareto count of a confor- 
mation. The Pareto count of Q measures the number of 
other conformations C, dominates. Pareto count is illu- 
strated in Figure 1. 

This work employs employs multi-objective analysis as 
a method for filtering the D, ensemble of conformations 
representing local minima. The ensemble Clpp corre- 
sponds to conformations that lie in the Pareto front and 
^pc{n) corresponds to conformations with a Pareto count 
above a given threshold value. The variable n is set to a 
particular percentage of D, and a Pareto count threshold 
is chosen such that | D,pc(n}\ = n * \ Cl \. For example, 
f2pc(5%) represents the 5% of conformations in D, with 
the highest values for Pareto count. 

Results and discussion 

Experimental setup The analysis is conducted over 15 
target protein systems listed in Table 1 which range from 
61-123 amino acids in length and cover the a, and a/p 
folds. Experiments are run for a fixed budget of 10,000,000 
energy function evaluations. Since over 90% of CPU time 
is spent on such evaluations, the limit ensures a fair com- 
parison between different parameter selections on a 
diverse set of proteins. Computing 10,000,000 energy 
function evaluations takes 1-4 days of CPU time on a 
2.4Ghz Core 17 processor, depending on protein length. 
The perturbation and minimization components are ana- 
lyzed first in the Analysis of BH framework section with 
respect to the AMW energy function. Lastly, the Multi- 
objective ensemble reduction section presents results for 
n ensembles obtained by running the BH framework with 
both the AMW and Rosetta energy functions. 

Analysis of BH framework 

Analysis is performed on the effect of biasing perturba- 
tion distance and varying the temperature of the local 
search in the BH framework. 
Biasing perturbation distance 

Our previous work shows a direct correlation between the 
mean IRMSD between consecutive local minima (referred 
to from now on as ^|mm |) and the ability of the BH frame- 
work to sample near-native conformations [25] . Figure 2 
shows that fi^MM \ can be effectively controlled by biasing 
the magnitude of the perturbation jump through a target 
perturbation distance D; as D is increased, there is a corre- 
sponding increase in fi\MM \- Tuning D does not signifi- 
cantly effect the single lowest IRMSD conformation 
sampled (IRMSD measures the proximity of a conformation 
to the experimental native structure and computed over the 
heavy backbone atoms). However, in cases where unbiased 
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Table 1 local search. 





Native PDB id 


Size 


fold 


% a 


%P 




Lowest Energy (kcal/mol) 






Lowest IRIVISD (A) 
















T=0 


To 


7, 


7-2 


T=0 


To 




T2 


1 


IdtdB 


61 


a/p 


15 


46 


-128.2 


-1 32 1 


-131 6 


-127.9 


6.9 


6 6 


69 


7.0 


2 


1 isuA 


62 


a/p 


15 


19 


-127.8 


-1 30 3 


-1 30 7 


-130.2 


6.3 


6 0 


64 


6.0 


3 


1c8cA 


64 


a/p 


22 


48 


-133.5 


-1348 


-1 30 8 


-129.6 


6.5 


6 6 


74 


7.3 


4 


1 sap 


66 


a/p 


30 


44 


-132.8 


-132.3 


-133.6 


-127.3 


6.5 


6.0 


6.8 


6.9 


5 


1 hz6A 


67 


a/p 


31 


42 


-143.5 


-144.7 


-142.1 


-138.9 


5.7 


5.9 


6.0 


6.0 


6 


1 wapA 


68 


R 

H 


0 


62 


-1 18.4 


-127.2 


-133.9 


-127.9 


7.4 


7.6 


74 


7.5 


7 


Ifwp 


69 


a/p 


30 


26 


-152.8 


-152.0 


-143.5 


-143.2 


6.3 


6.7 


6.5 


6.1 


8 


lail 


70 


a 


84 


0 


-170.6 


-171.0 


-167.3 


-1684 


3.2 


3.2 


34 


3.3 


9 


laoy 


78 


a/p 


41 


10 


-183.9 


-181.2 


-180.8 


-184.1 


5.7 


64 


6.0 


64 


10 


1cc5 


83 


a 


47 


4 


-170.9 


-171.5 


-179.1 


-173.8 


5.8 


5.7 


5.8 


5.8 


11 


2ezk 


93 


a 


68 


0 


-217.3 


-218.6 


-2244 


-216.0 


4.3 


4.6 


4.2 


44 


12 


Ihhp 


99 


P 


7 


48 


-168.7 


-1754 


-179.0 


-175.9 


104 


104 


10.0 


10.5 


13 


2hg6 


106 


a/p 


34 


21 


-233.6 


-236.8 


-239.5 


-235.1 


8.8 


9.0 


8.8 


9.2 


14 


3gwl 


106 


a 


70 


0 


-264.6 


-2704 


-273.9 


-267.3 


4.9 


49 


44 


5.2 


15 


2h5nD 


123 


a 


71 


2 


-307.8 


-313.0 


-316.5 


-313.2 


7.5 


7.9 


74 


8.1 



Columns 2-4 show the native PDB id, size and fold topology for each of the 15 target protein systems. Columns 5 and 6 break the fold topology down as the 
percentage of amino acids which are part of S-helices and f3-sheets. Columns 7-10 report the minimum energy achieved for each temperature T of the 
minimization component of the BH framework. Columns 11-14 then report the corresponding lowest IRMSD to the native structure achieved for each T . 



perturbation results in large /4\mm \ values, changing D does 
effect how frequently near-native conformations are 
sampled (that is, the distribution of sampled minima). 
Figure 3 illustrates this for two representative systems by 
plotting, for different values of D, the distribution of I4\mm | 
values and the resulting distribution of IRMSD values. 
These results show that there is a distinct advantage to 
biasing the perturbation distance to D = lA or D = 2A. 
Figures 3(a) and Figure 3(c) show that the frequency of 
small fi\MM I is larger when dL {1, 2} A vs. an unbiased per- 
turbation. Figures 3(b) and Figure 3(d) show that the result- 
ing ensembles contain more low-lRMSD conformations 
than the unbiased approach. 



The effect of controlling D shown in Figure 3 is stron- 
gest on more heavily /J-sheet proteins (those with native 
PDB ids IdtdB, lisuA, IwapA, and Ihhp). On these 
proteins, an unbiased perturbation results in few small 
consecutive local minima distances. More near-native con- 
formations are also obtained (though to a lesser extent) 
when D e {1, 2} for other proteins (with native PDB ids 
lail, Isap, and 2h5nD). On these proteins, unbiased per- 
turbation results in larger numbers of small consecutive 
local minima distances, but these proteins still benefit 
from enhanced sampling of neighboring local minima. 

This enhanced sampling of near-native conformations 
can correspond to the BH search remaining in the same 
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S- 

9 0.04 
fa 

ii 0.03 
a 

§ 0.02 
0.01 
0, 



(c) lisuA 



^~Unbiase< 
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PDB ID lail 
Topology o 
Size 70 
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IRMSD to the native structure f A) 



(b) lail 




IRMSD to the native structure f A) 
(d) lisuA 



Figure 3 The frequencies oi yi\MM \ sampled during the search for proteins with native structure PDB Ids 1ail and llsuA are shown in 
(a) and (c), respectively. Frequency of IRMSDs to the native structure for each protein are given in (b) and (d), respectively. The solid red line 
represents BH employing the unbiased perturbation method. The dashed lines represent BH with median perturbation distances D = 1 A to D = 5A. 



near-native region of the space; low D values could poten- 
tially cause the minimization to return to the previous 
minimum. In practice, this does occur for D = lA; how- 
ever, when D >1A, the search returns to previous local 
minima the same or less frequently than the unbiased 
approach. 

MMC versus greedy search in minimization 

Table 1 compares the greedy search {T = 0) to MMC 
searches with temperatures Tq, Ti, and The lowest 
energies achieved under each setting are shown in columns 
7-10. Results show that employing MMC as the minimiza- 
tion step achieves lower energy conformations than 
employing greedy search. In general, MMC with 7" = 0 
achieves the lowest energy values for proteins less than 80 
amino acids in length, while the lowest energies are 
achieved by the slightly higher temperature of for longer 
proteins. This is possibly because in more complex rugged 
surfaces, small uphill moves allow reaching deeper minima. 

The energy surface sampled by the BH framework for 
each given value of T is illustrated in Figure 4. The x and 
y-axes represent geometric projections of the conforma- 
tions based on interatomic distances, and the z-axis 
represents the energy of each sampled local minimum. 



The Geometric projections are based on the mean intera- 
tomic distances between selected atoms (see [19] for more 
details). A large white "x" represents the location of the 
experimentally-determined native structure. Figure 4 illus- 
trates that coarse-grained energy functions are noisy and 
result in surfaces that can deviate from the true protein 
energy surface. Columns 11-14 in Table 1 show, for each 
value of T , the lowest IRMSD to the native structure over 
Cl. The lowest IRMSD values obtained are comparable 
whether greedy or MMC search is employed in the mini- 
mization. This suggests that MMC minimization's ability 
to probe deeper into minima does not necessarily bring 
the BH search closer to the native structure. 

The higher computational cost of each MMC minimi- 
zation results in fewer sampled minima (total number of 
energy evaluations is fixed). Employing MMC in place of 
greedy search thus reduces the total number of hops in 
the BH trajectory by 50 to 70%, resulting in correspond- 
ingly fewer sampled minima. Columns 11-14 in Table 1, 
however, show that a lower number of sampled minima 
does not necessarily correlate with worse proximity to 
the native state. Focusing on a smaller ensemble of 
"interesting" local minima allows more computationally 
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(c) 7-, (rl) 72 

Figure 4 The energy surface sampled for the protein with native PDB id Ifwpis shown for each temperature T The x and y-axes 
represent projection coordinates based on interatomic distances within each conformation, and the z-axis represents the energy of each 
sampled local minimum. The white "x" indicates the location of the native structure in the energy surface. 



intensive refinement steps to focus resources more 
effectively. 

Multi-objective ensemble reduction 

Multi-objective ensemble reduction proposed in the Meth- 
ods section is evaluated by comparing its ability to retain 
near-native conformations to that of employing a thresh- 
old based on total energy. The use of the Pareto front and 
the Pareto count as metrics for ensemble reduction are 
evaluated in the "Pareto front reduction technique" and 
"Pareto count reduction technique" sections, respectively. 
To further evaluate the effectiveness of the multi-objective 
reduction technique, results are given for both the AMW 
energy function and the Rosetta coarse-grained energy 
function with "score3" weights. The ensembles Cly^Mw and 
^Rosetta are generated for each target protein with the BH 
framework described in Methods employing unbiased per- 
turbation and T = 0 for minimization. 

The total energy for each conformation D. is decom- 
posed into individual energy terms described in Methods. 
Since multi-objective analysis is highly sensitive to the 
number of energy terms, the Rosetta energy terms are 
then combined into 5 groups so the number of terms is 
consistent between vt? and 0,iiosetta in the multi-objec- 
tive analysis. Grouping is done based on correlation 
between energy terms; more highly correlated terms are 
combined. In this work, the following energy term group- 
ings are employed: {env, pair, cbeta, rg}, {vdw}, {cenpack}, 
{hs_pair}, {ss_pair, rsigma, sheet}. Since the terms ss_pair, 
rsigma, and sheet are primarily employed in the evaluation 



of beta sheets, their values often remain fixed for proteins 
without beta sheets or for proteins in which beta sheets 
are not accurately modeled. If one term remains fixed, 
then it is impossible for one conformation to dominate 
another using strong Pareto dominance as described in 
the Multi-objective ensemble reduction section. Therefore 
weak dominance is employed when performing multi- 
objective analysis on D.Rosetta- 

Tables 2 and 3 compare the ensemble reduced through 
a total energy threshold, D.TE(n)i to the ensembles reduced 
by employing the Pareto front, D,pf , and the Pareto count, 
Clpc(n)t for the AMW and Rosetta energy functions. The 
ensemble CItem is achieved by selecting a total energy 
threshold and removing all conformations with total 
energy greater than the threshold. The variable n is set to 
a particular percentage of Ci. and a total energy threshold 
is chosen such that I O 



TE(n)\ 



* I n |. Recall that the 
ensemble Q.pc(n) is constructed similarly to Q.TE(n)> how- 
ever, the Pareto count is employed in place of total energy 
to rank conformations. For Clpp only conformations in the 
non-dominated Pareto front are retained. For Clpc(n) and 
^TE(n)i n can be set to any percentage of O, while the size 
of Clpp is dictated by the size of the Pareto front for a 
given n. 

Pareto front reduction technique 

Column 3 in Tables 2 and 3 shows that, when consider- 
ing only conformations in the Pareto front, D,pF , the size 
of Cl is reduced by over 90% across all target proteins 
and at least 95% for the majority of proteins. This shows 
that the Pareto front filter is a highly effective method for 
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Table 2 AMW multi-objective reduction technique. 

AMW Energy Function 

Native PDB Id Clpp reduction Minimum IRMSD (A) 







V — |»*PF Kr^l' 










f^T£(10%) 


^PC(5%) 


^PC( 10%) 


1 
1 


1 OtOD 


470 


/.Z 


7 Q 


7 7 
I.I 


7 Q 

/.y 


7 7 
I.I 


7 7 
I.I 


7 7 


z 


1 isuA 


/tO 


o.U 


O.Z 


D.J 


D.4 


D.z 


D.Z 


D.Z 


J 


1 LoCA 


AO/r. 

4-70 


7 /I 


7 c 
/ .J 


7 r 
/ .J 


7 c 
/ .J 


7 r 
/ .J 


7 c 
/.J 


7 r 
/ .J 




1 S3p 


ZyO 


D.J 


7 ^ 


7 r 
/.J 


7 A 


7 -) 
/ .Z 


7 /I 


7 -) 
/ .Z 


5 


lhz6A 


2% 


5.9 


6.7 


6.3 


6.7 


6.7 


6.7 


6.6 


6 


1 wapA 


2% 


7.7 


8.7 


8.7 


8.7 


8.7 


8.7 


8.7 


7 


Ifwp 


7% 


64 


8.1 


7.3 


8.1 


8.1 


8.1 


8.1 


8 


lail 


2% 


34 


6.8 


5.9 


5.8 


4.2 


4.7 


44 


9 


laoy 


6% 


5.7 


6.9 


6.6 


6.9 


6.5 


6.8 


6.5 


10 


1cc5 


7% 


5.6 


8.6 


7.0 


8.7 


8.6 


8.6 


8.1 


11 


2ezk 


3% 


44 


8.0 


7.3 


7.7 


7.1 


7.2 


7.1 


12 


Ihhp 


1% 


10.7 


12.0 


12.0 


11.6 


11.6 


11.6 


10.8 


13 


2hg6 


6% 


8.6 


10.8 


10.5 


11.6 


10.8 


10.9 


10.8 


14 


3gwl 


5% 


4.2 


4.7 


5.2 


4.7 


4.7 


4.7 


4.7 


15 


2h5nD 


7% 


7.9 


10.7 


10.0 


10.8 


104 


10.4 


104 


The minimum IRMSD to tlie native structure retained by eacii of tile proposed multi-objective ensemble reduction techniques is 
the AlVlW energy function. Column 3 gives the size of the Pareto front as a percentage of the size of 0. Column 4 gives the min 
structure of any conformation in the 0. Columns 5 and 6 give minimum IRMSD retained by ^te(/) snd Opp , respectively, where r 
Column 3. Columns 7-10 compare the minimum IRMSD retained by €1t^„i and Opcw fof thresholds of n = 5% and n = 10%. 
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conformations. The difference between the 


average size 


IRMSD to the native structure of all conformations in O, 


of Clpp employing , 
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AMW and Clpp employing Rosetta is 
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D.XE(n=r)' Eind Clpf , respectively. Here r is chosen such that 
1 ^TE(n=r)\ = 1 ^PF |. SO a fair comparison can be made. 


weak dominance Rosetta. 






While neither ensemble reduction technique is able to 


Table 3 Rosetta multi-objective reduction technique. 




















Rosetta Energy Function 












Native PDB id 


ClpF reduction 








Minimum IRMSD (A) 










(r = 1 n |/1 n I) 


n 








Ore(io%) 




O(>C(10%) 


1 


IdtdB 


1% 


6.7 


10.8 


9.1 


10.6 


10.2 


10.2 


8.6 


2 


lisuA 


2% 


6.5 


8.9 


8.6 


8.9 


8.6 


8.0 


7.5 


3 


1c8cA 


2% 


5.6 


7.9 


7.1 


7.8 


7.0 


7.1 


6.8 


4 


Isap 


3% 


6.1 


7.4 


7.1 


74 


6.8 


6.8 


6.6 


5 


1hz6A 


3% 


2.5 


2.8 


2.8 


2.8 


2.6 


2.7 


2.6 


6 


IwapA 


1% 


7.4 


8.8 


8.8 


8.5 


8.5 


8.8 


8.1 


7 


Ifwp 


3% 


6.1 


7.2 


7.0 


7.1 


7.1 


7.2 


6.9 


8 


lail 


>1% 


4.8 


8.2 


6.2 


7.6 


7.5 


7.5 


6.9 


9 


laoy 


2% 


6.2 


10.1 


9.1 


9.2 


9.2 


9.3 


9.2 


10 


1cc5 


1% 


5.0 


6.3 


6.3 


5.7 


5.7 


5.5 


54 


11 


2ezk 


1% 


3.9 


9.1 


6.2 


5.2 


5.1 


5.1 


4.9 


12 


Ihhp 


3% 


10.8 


13.9 


12.6 


13,9 


13.6 


13.0 


12.9 


13 


2hg6 


2% 


10.6 


12.2 


11.5 


12.0 


12.0 


12.0 


11.7 


14 


3gwl 


1% 


7.1 


8.9 


8.5 


8.7 


8.4 


8.0 


7.8 


15 


2h5nD 


1% 


8.9 


13.0 


10.4 


12.3 


12.1 


12.2 


114 



The minimum IRMSD to the native structure retained by each of the proposed multi-objective ensemble reduction techniques is given for the n generated with 
the Rosetta energy function. Column 3 gives the size of the Pareto front as a percentage of the size of O. Column 4 gives the minimum IRMSD to the native 
structure of any conformation in the O. Columns 5 and 6 give minimum IRMSD retained by ^rm and Clpp , respectively, where r is the corresponding value from 
Column 3. Columns 7-10 compare the minimum IRMSD retained by €1t^„i and Opcw for thresholds of n = 5% and n = 10%. 
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retain the lowest IRMSD to native conformations from D., 
comparison of columns 5 and 6 reveals that D,pf retains 
conformations with IRMSDs to native not higher than 
CixE^r) for all but two proteins when employing the AMW 
energy function (Table and for all proteins when employ- 
ing the Rosetta energy function (Table 3). This difference 
in IRMSD is significant (O.SA or greater) for proteins with 
native PDB ids Ifwp, lail, lcc5, 2ezk, 2h5nD for AMW 
and IdtdB, lc8cA, lail, lacy, 2ezk, Ihhp, 2hg6, 2h5nD for 
Rosetta. 

Merely looking at the minimum IRMSD to native struc- 
ture retained does not tell the entire story. Figures 5(d) 
and 6(d) plot the energy versus IRMSD to native for each 
conformation in fl for the AMW and Rosetta energy func- 
tions, respectively, for a representative protein with native 
PDB id Isap. Conformations in Clpp are highlighted in 
dark blue and a dashed line represents the energy cutoff 
for D.TE(n=ry For both energy functions, Clpp retains lower 
IRMSD to native conformations than Q.TE(n=r) and D.te 
(n^r) loses significantly more of these near-native confor- 
mations. These results show that there is a clear advantage 
to employing the Pareto front over a total energy thresh- 
old to select conformations from O, and these results hold 
whether employing AMW or Rosetta. 

Figure 6(e) represents an unusual case (illustrated by the 
protein with native PDB id lhz6A) where the correlation 
between total energy and IRMSD to native is very high. 
High correlation is rarely the case for coarse-grained 
energy functions. We have specifically chosen to show 
lhz6A here because Rosetta seems to capture well the 
true energy surface for this protein. For lhz6A, a total 
energy threshold alone is sufficient for selecting decoy 
conformations with low IRMSDs, given this high correla- 
tion. In a blind prediction, the native structure is unknown 
and thus IRMSDs are not available. Thus, such cases are 
difficult to identify and the the Pareto front is still just as 
effective as a total energy threshold. 
Pareto count reduction technique 

Unlike Slpp , the size of ^pc(n) can be set for any desired 
value of n. Figures 5(a)-(c) (AMW energy function) and 
6(a)-(c) (Rosetta energy function) show the minimum 
IRMSD to native for Cipc{n) (dashed red line) and 0,TE(n) 
(solid black line) for n e {1, 2, 3.. .100} on three selected 
proteins with PDB ids Isap, lhz6A, and 2ezk. The mini- 
mum IRMSD and size of Clpp is also given for reference as 
a blue "X". Examination reveals that 0.pc(„) retains confor- 
mations with IRMSDs to the native structure as low or 
lower than O.TE(n} for values of « <= 10% for ail three pro- 
teins. This result is representative of all 15 target proteins 
investigated in this study. Columns 7-10 of Tables 2 and 3 
give the minimum IRMSD for O.TE{n) and 0,pc(n) for n = 
5% and n = 10% for all 15 target proteins. 

Figures 5(g)-(i) and Figure 6(g)-(i) plot the energy versus 
IRMSD to native for each conformation in O for the 



AMW and Rosetta energy functions, respectively, for same 
three representative proteins (PDB ids Isap, lhz6A, and 
2ezk). Conformations in Clpc(5%) and Clpc{m%) are high- 
lighted in blue and red, respectively. The dashed blue and 
red lines represent the total energy cutoffs for D.te(5%) and 
^TE(io%)> respectively. Examination of the common case of 
Isap reveals that 0,pc(„) retains significantly more low- 
IRMSD conformations than CItem for a given value of n. 
In the unusual case of lhz6A, for which total energy is 
highly correlated with IRMSD, 0,pc{„) retains a similar 
range of low-lRMSD structures as O.TE{n) does. 

The protein with PDB id 2ezk represents a case where 
ClpF is not effective at retaining low IRMSD structures. 
Figures 5(f) and Figure 6f show that the low-lRMSD con- 
formations retained by Q.pF are outliers, particularly for 
the Rosetta energy function. Examination of Figures 5(i) 
and Figure 6(i) reveals that, for this difficult case, Clp(^^^^^ is 
still effective at sampling a range of low-lRMSD conforma- 
tions. A similar results is seen for the protein with PDB id 
lail (data not shown here). 

Taken together, these results show that employing 
multi-objective analysis to filter the output ensemble pro- 
vides a distinct advantage over a total energy criterion. 
The ensemble size reduction is dramatic, yet non-outlier 
low-lRMSD conformations are still retained. In difficult 
cases the Pareto count metric retains low-lRMSD confor- 
mations even when the Pareto front does not. 

Conclusions 

This work shows that careful realizations of the BH fra- 
mework can provide both rapid sampling and enhanced 
sampling of the protein conformational space. In addition 
to previous work, where a simple realization of the BH 
framework was shown competitive in terms of obtaining 
lowest IRMSDs to the native structure comparable to 
state-of-the-art MC-based methods [25], this work shows 
the high sampling capability and the diversity of the 
decoy ensemble obtained by BH-based algorithms. We 
draw attention to the ability of the algorithm to obtain 
many non-native conformations of low energies, which is 
a hallmark of algorithms with high sampling capability 
[47,48]. 

This work provides a deeper understanding of the BH 
framework and its premise for obtaining an effective 
decoy ensemble. The two algorithmic components of the 
framework, perturbation and minimization, are analyzed 
in detail, and effective implementations are offered to 
control the exploration for the purpose of obtaining a 
diverse decoy ensemble. Results show that the distance 
between consecutively-sampled local minima is directly 
affected by the perturbation distance. Our experiments 
demonstrate that by biasing perturbation distance, one 
can enhance sampling of near-native decoys in the BH 
framework. Moreover, a simple greedy search was shown 
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Figure 5 Results for each of the proposed multi-objective ensemble filtering methods are shown for the AMW energy function on three 
representative proteins with native PDB ids Isap, 1hz6A and 2ezk. (a)-(c) show the minimum IRMSD to the native structure retained from the 
full ensemble O in the reduced ensembles flpq^) (dashed red line) and CljEin) (solid black line), for a given percentage n of the conformations in Q. The 
minimum IRMSD retained by CIrf is marked with a blue "X". (d)-(f) show the total energy versus IRMSD to the native structure for each conformation in 
the ensemble O. Conformations corresponding the the Pareto front, Clpp , are colored in dark blue. The dashed line represents the energy cutoff such 
that pmn)! = ppF I- In (g)-(i), conformations are colored according to their Pareto count. Conformations in QpQn) are colored in blue and red for n = 
5% and n = 1 0%, respectively. The dashed lines represents the total energy cutoff for conformations in QjEiny 



just as effective at sampling near-native conformations as 
a more computationally intensive MMC trajectory. 

Employing short greedy searches for minimization is 
appealing, as it allows sampling a significantly larger num- 
ber of local minima than longer MMC trajectories. This 



larger ensemble provides a broad view of low-energy local 
minima in the coarse-grained energy surface, but inaccura- 
cies in the energy function do not allow relating near- 
native conformations with the lowest-energy minima. To 
deal with this issue, we present an ensemble reduction 
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Figure 6 Results for each of the proposed multi-objective ensemble filtering methods is shown for the Rosetta coarse-grained energy 
function on three representative proteins with native PDB ids Isap, 1hz6A and 2ezk (a)-{c) show the minimum IRMSD to the 
experimentally determined native structure retained from the full ensemble O in the reduced ensembles Opc(n) (dashed red line) and Oj-^n) (solid 
black line), for a given percentage n of the conformations in O. The minimum IRIVISD retained by Qpf is marked with a blue "X". (d)-(f) show the 
total energy versus IRIVISD to the native structure for each conformation in the ensemble O. Conformations corresponding the the Pareto front, 
Clpf , are colored in dark blue. The dashed line represents the energy cutoff such that \C>rE{n)\ = \^pf |- In (g)-(i), conformations are colored 
according to their Pareto count. Conformations in npc(n) are colored in blue and red for n = 5% and n = 10%, respectively. The dashed lines 
represents the total energy cutoff for conformations in CijEiny 



technique based on multi-objective analysis. Metrics based 
on the Pareto front and Pareto count are proposed, and 
analysis is performed on the decoy ensemble generated by 
our BH framework employing either the AMW or the 
Rosetta coarse-grained energy functions. 



For all of proteins investigated in this work, the Pareto- 
based reduction technique is highly effective at reducing 
the ensemble while still maintaining non-outlier near- 
native conformations. Multi-objective metrics based on 
Pareto dominance are an ideal choice because they can 
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be computed online and have lower computational com- 
plexity than structure-based clustering algorithms. Future 
work will investigate this setting to further enhance 
sampling capability while retaining an informative con- 
formational ensemble. 
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